1 What are these samples?

Phosphorylated ribosome-IP captured mRNA from mouse olfactory epithelium. One group of mice smelled acetophenone for an hour, the other group did not. Because ribosome phosphorylation only occurs in activated neurons, we expect to find odorant receptors responding to acetophenone by looking for over-represented mRNA transcripts in stimulated samples.

(For details, see http://www.nature.com/neuro/journal/v18/n10/abs/nn.4104.html)

2 Set up environment

Let’s set up the environment where we work. Let’s assume that you have a folder on your desktop called achems, and we will use that folder as our working directory. We will download our input file (the count table) to that folder.

If a library hasn’t been installed yet, you can install it using e.g. install.packages("ggplot2"). For bioconductor libraries, use biocLite instead, e.g.

source("https://bioconductor.org/biocLite.R")
biocLite("edgeR")
  • set working directory
  • download count table
  • load libraries
  • custom plot theme
setwd("~/Desktop/achems/")
system("curl -O https://raw.githubusercontent.com/Yue-Jiang/achems/master/count_table.tab")

library(ggplot2)
library(ggrepel)
library(dplyr)
library(tidyr)
library(edgeR)
library(grid)
library(gridExtra)
library(knitr)
library(superheat)
library(plotly)
library(clusterProfiler)
library(org.Mm.eg.db)

# just a theme for plotting! don't worry about it!
my_theme <- function(base_size=14) {
  (theme_bw(base_size = base_size)
    + theme(plot.title = element_text(face = "bold",
                                      size = rel(1.2), hjust = 0.5),
            text = element_text(),
            panel.background = element_rect(colour = NA),
            plot.background = element_rect(colour = NA),
            panel.border = element_rect(colour = NA),
            axis.title = element_text(face = "bold",size = rel(1)),
            axis.title.y = element_text(angle=90,vjust =2),
            axis.title.x = element_text(vjust = -0.2),
            axis.text = element_text(), 
            axis.ticks = element_line(),
            panel.grid.major = element_line(colour="#f0f0f0"),
            panel.grid.minor = element_blank(),
            legend.key = element_rect(colour = NA)
    )
  )
}

3 Read in count table

Differential expression analysis softwares usually ask that you have a table of your expression data. Each software may have slightly different requirements in the way they want their input data prepared. Usually we start with a data frame with expression levels in counts (integer).

  • Each row is a gene
  • Each column is a sample
count_df <- read.table("count_table.tab", sep="\t", header=TRUE, stringsAsFactors=FALSE)

This is what the count table looks like (showing top 10 rows, 23481 rows total).

Gene Ctrl_A Ctrl_B Ctrl_C Stim_A Stim_B Stim_C
0610005C13Rik 0 0 0 0 0 0
0610007N19Rik 82 59 21 36 32 15
0610007P14Rik 58 91 68 62 74 58
0610008F07Rik 0 0 0 0 0 0
0610009B14Rik 2 0 1 0 0 0
0610009B22Rik 271 577 328 253 444 325
0610009D07Rik 270 654 372 260 474 430
0610009L18Rik 32 22 14 18 24 16
0610009O20Rik 212 122 74 145 139 97
0610010B08Rik 1 8 5 1 11 12

EdgeR asks that your input is a matrix.

countData <- as.matrix(count_df[, -1])
rownames(countData) <- count_df$Gene
# filter out genes lowly expressed, this is an arbitrary cutoff
counts <- countData[ rowSums(countData) > 5, ]
y <- DGEList(counts=counts)

4 Make your design table

Design table specifies experiment conditions for your samples. In this experiment, we have 6 samples with treatment (control vs stimulated) and littermate information (A, B, C). We are interested in genes differentially expressed in stimulated samples as compared to the control.

targets <- data.frame("Sample"=c("Ctrl_A", "Ctrl_B", "Ctrl_C", "Stim_A", "Stim_B", "Stim_C"),
                        "Treatment"=c("control", "control", "control", "stimulated", "stimulated", "stimulated"),
                        "Litter"=c("A", "B", "C", "A", "B", "C"))
Litter <- factor(targets$Litter)
Treat <- factor(targets$Treatment, levels=c("control","stimulated"))

The targets table (the table you made, more human readable) looks like this.

Sample Treatment Litter
Ctrl_A control A
Ctrl_B control B
Ctrl_C control C
Stim_A stimulated A
Stim_B stimulated B
Stim_C stimulated C

5 Differential expression analysis

Let’s perform differential expression analysis using edgeR. Remember that we are only interested in the effect of treatment. The litter or batch effect should be included in the model as a random effect to absorb the variance explained by batch so that we get better sensitivity. If we didn’t have litter matched experiment design, we can only compare control and treatment as two homogeneous groups (which is also the most common use case, so let’s do it anyways).

design_nolitter <- model.matrix(~Treat)

The design looks like this.

(Intercept) Treatstimulated
1 0
1 0
1 0
1 1
1 1
1 1
y_nolitter <- estimateDisp(y, design_nolitter)
fit_nolitter <- glmFit(y_nolitter, design_nolitter)
lrt_nolitter <- glmLRT(fit_nolitter)
de_nolitter <- topTags(lrt_nolitter, n=dim(lrt_nolitter)[1])$table

Without using litter (batch) information, we only recover 6 DE genes (FDR < 0.05).

kable(de_nolitter[de_nolitter$FDR < 0.05, ])
logFC logCPM LR PValue FDR
Tmem19 -6.213243 3.498855 45.15677 0.00e+00 0.0000002
Tmem190 6.203127 3.387871 45.02833 0.00e+00 0.0000002
Olfr937 4.999412 3.561300 24.87306 6.00e-07 0.0035015
Olfr923 3.499782 5.023108 21.31618 3.90e-06 0.0167017
Olfr19 4.395690 4.797225 19.18782 1.18e-05 0.0399375
Olfr1047 4.155271 3.624147 18.87346 1.40e-05 0.0399375

Now let’s do it properly, by explicitly accounting for litter effects.

design <- model.matrix(~Litter + Treat)

Now the design looks like this.

(Intercept) LitterB LitterC Treatstimulated
1 0 0 0
1 1 0 0
1 0 1 0
1 0 0 1
1 1 0 1
1 0 1 1
y <- estimateGLMCommonDisp(y,design)
y <- estimateGLMTrendedDisp(y,design)
y <- estimateGLMTagwiseDisp(y,design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
de <- topTags(lrt,n=dim(lrt)[1])$table

Now we get better sensitivity!

# number of DE genes now
sum(de$FDR < 0.05)
## [1] 80

Here are the top 10 most DE genes.

kable(de[order(de$FDR)[1:10], ])
logFC logCPM LR PValue FDR
Olfr745 3.259074 6.488694 129.73623 0 0
Olfr376 3.294949 5.767395 113.52111 0 0
Ranbp10 6.446126 4.140624 84.22813 0 0
Olfr19 4.313269 4.793146 82.85546 0 0
Olfr749 2.752854 6.484576 77.86890 0 0
Tmem19 -6.199616 3.500925 74.81520 0 0
Tmem190 6.179645 3.379617 71.89756 0 0
Pcdh10 2.203870 6.019768 70.09725 0 0
Olfr30 2.634877 4.977079 66.96668 0 0
Olfr937 4.954899 3.554047 62.04721 0 0

6 Visualizing DE result

For clarity, let’s make a data frame to hold the data for plotting. The data frame will contain the following columns:

  • Gene name
  • log2 CPM (counts per million)
  • log2 fold change
  • FDR corrected p-value
  • An annotation column indicating whether this gene is an OR
  • Another annotation column indicating whether this gene is differentially expressed
de_df <- data.frame("Gene"=rownames(de),
                    "log2CPM"=de$logCPM,
                    "log2FoldChange"=de$logFC,
                    "FDR"=de$FDR) %>%
  # add an annotation column indicating whether this gene is an OR
  mutate(Annotation=case_when(grepl("^Olfr", .$Gene) ~ "Odorant receptor",
                              TRUE ~ "Other genes")) %>%
  # add another annotation column indicating whether gene is enriched / decreased / not DE
  mutate(DE=case_when(.$FDR <= 0.05 & .$log2FoldChange > 0 ~ "Enriched (p<=0.05)",
                      .$FDR <= 0.05 & .$log2FoldChange < 0 ~ "Decreased (p<=0.05)",
                      TRUE ~ "p>0.05"))

For the purpose of this study, we are primarily interested in differentially expressed ORs only. As such, we should restrict the FDR correction to OR genes only, otherwise we are losing some sensitivity.

or_de_df <- data.frame("Gene"=rownames(de),
                       "log2CPM"=de$logCPM,
                       "log2FoldChange"=de$logFC,
                       "pvalue"=de$PValue) %>%
  # only keep ORs (genes starting with Olfr)
  filter(grepl("^Olfr", Gene)) %>%
  # fill N/A adjusted p-values with 1's
  mutate(FDR=p.adjust(pvalue, method="BH")) %>%
  # add another annotation column indicating whether gene is enriched / decreased / not DE
  mutate(DE=case_when(.$FDR <= 0.05 & .$log2FoldChange > 0 ~ "Enriched (p<=0.05)",
                      TRUE ~ "Others"))

6.1 MA plot

An MA plot is plotting the log2 fold change vs mean expression level. We can customize the MA plot by highlighting genes of interest (for example, all odorant receptor genes).

p1 <- ggplot(de_df %>% arrange(desc(FDR)), aes(log2CPM, log2FoldChange, color=DE)) +
  geom_point(alpha=0.8, size=1) +
  scale_color_manual(values=c("forestgreen", "red", "grey80")) +
  my_theme()
p2 <- ggplot(de_df %>% arrange(desc(Annotation)), aes(log2CPM, log2FoldChange, color=Annotation)) +
  geom_point(alpha=0.8, size=1) +
  scale_color_manual(values=c("red", "grey80")) +
  my_theme()
grid.arrange(p1, p2, ncol=1)

6.2 Volcano plot

Another popular way to visualize DE analysis is to plot log scaled p-values against log scaled fold change. We can customize the volcano plot by labeling the ORs that are enriched in stimulated samples.

ggplot(or_de_df, aes(log2FoldChange, log10(FDR), color=DE)) +
  geom_point(alpha=0.8, size=1) +
  geom_text_repel(data=or_de_df %>% filter(DE == "Enriched (p<=0.05)"), aes(label=Gene), segment.colour="grey50", color="grey10") +
  scale_y_reverse() +
  scale_color_manual(values=c("red", "grey80")) +
  my_theme()

Well, that’s probably not that good an idea. Let’s label M72 (Olfr160) only.

ggplot(or_de_df, aes(log2FoldChange, log10(FDR), color=DE)) +
  geom_point(alpha=0.8, size=1) +
  geom_text_repel(data=or_de_df %>% filter(Gene == "Olfr160"),
                  label="M72",
                  min.segment.length=unit(0, 'lines'),
                  nudge_x=3,
                  segment.colour="grey50",
                  color="grey10") +
  scale_y_reverse() +
  scale_color_manual(values=c("red", "grey80")) +
  my_theme()

7 Gene ontology

Are certain pathways enriched in the DE genes?

de_genes <- as.character(filter(de_df, FDR <= 0.05)$Gene)
background_genes <- as.character(filter(de_df, logCPM >= log2(10))$Gene)
go <- enrichGO(gene=de_genes,
               universe=background_genes,
               OrgDb = org.Mm.eg.db, keytype = 'SYMBOL',
               ont = "BP",
               pAdjustMethod = "BH",
               pvalueCutoff = 0.1,
               qvalueCutoff = 0.1
               )
kable(go@result)
ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
GO:0007608 GO:0007608 sensory perception of smell 29/76 160/8158 0.0000000 0.0000000 0.0000000 Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 29
GO:0007606 GO:0007606 sensory perception of chemical stimulus 29/76 175/8158 0.0000000 0.0000000 0.0000000 Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 29
GO:0007186 GO:0007186 G-protein coupled receptor signaling pathway 29/76 272/8158 0.0000000 0.0000000 0.0000000 Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 29
GO:0007600 GO:0007600 sensory perception 29/76 279/8158 0.0000000 0.0000000 0.0000000 Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 29
GO:0050877 GO:0050877 neurological system process 29/76 407/8158 0.0000000 0.0000000 0.0000000 Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 29
GO:0050907 GO:0050907 detection of chemical stimulus involved in sensory perception 4/76 38/8158 0.0004039 0.0202602 0.0202602 Olfr376/Olfr19/Olfr397/Olfr430 4
GO:0009593 GO:0009593 detection of chemical stimulus 4/76 48/8158 0.0009923 0.0426686 0.0426686 Olfr376/Olfr19/Olfr397/Olfr430 4
GO:0050906 GO:0050906 detection of stimulus involved in sensory perception 4/76 54/8158 0.0015463 0.0581777 0.0581777 Olfr376/Olfr19/Olfr397/Olfr430 4

Since there are some redundancy in the GO terms, we can simply them.

sim_go <- simplify(go, cutoff = 0.7, by = "p.adjust", select_fun = min, measure = "Rel")
kable(sim_go@result)
ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
GO:0007608 GO:0007608 sensory perception of smell 29/76 160/8158 0 0 0 Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 29
GO:0007186 GO:0007186 G-protein coupled receptor signaling pathway 29/76 272/8158 0 0 0 Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 29

8 Clustering

Sometimes we want to do exploratory analysis by clustering samples based on RNA expression profiles and see if the clustering makes biological sense. Let’s see a few examples.

8.1 hierarchical clustering

Let’s cluster the samples based on OR expression. Well, they actually cluster by litter… So it’s important to account for litter effect as we did before!

orCountData <- countData[grepl("^Olfr", rownames(countData)) & rowSums(countData) > 5, ]
superheat::superheat(t(log1p(orCountData)),
                     pretty.order.rows = TRUE,
                     pretty.order.cols = TRUE,
                     left.label.size = 0.3,
                     left.label.text.size = 3,
                     left.label.text.alignment = "right",
                     left.label.col = "white",
                     # scale the matrix columns
                     scale = TRUE,
                     row.dendrogram = TRUE,
                     col.dendrogram = TRUE,
                     legend=FALSE)

8.2 PCA

Just to showcase some PCA visualization.

or_pca <- data.frame(prcomp(t(log1p(orCountData)), scale.=TRUE)$x)
plot_ly(or_pca, x = ~PC1, y = ~PC2, z= ~PC3, text = rownames(or_pca)) %>%
  add_markers() %>%
  add_text() %>%
  layout(showlegend = FALSE)

9 Session info

Keep note of what packages were used and their versions for future reference.

devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.3.1 (2016-06-21)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Los_Angeles         
##  date     2017-04-23
## Packages ------------------------------------------------------------------
##  package         * version  date       source                             
##  annotate          1.50.0   2016-05-04 Bioconductor                       
##  AnnotationDbi   * 1.34.4   2016-07-08 Bioconductor                       
##  assertthat        0.1      2013-12-06 CRAN (R 3.3.0)                     
##  backports         1.0.4    2016-10-24 CRAN (R 3.3.0)                     
##  base64enc         0.1-3    2015-07-28 CRAN (R 3.3.1)                     
##  Biobase         * 2.32.0   2016-05-04 Bioconductor                       
##  BiocGenerics    * 0.18.0   2016-05-04 Bioconductor                       
##  clusterProfiler * 3.0.5    2016-08-19 Bioconductor                       
##  colorspace        1.2-6    2015-03-11 CRAN (R 3.3.0)                     
##  DBI               0.4-1    2016-05-08 CRAN (R 3.3.0)                     
##  devtools          1.12.0   2016-06-24 CRAN (R 3.3.0)                     
##  digest            0.6.12   2017-01-27 cran (@0.6.12)                     
##  DO.db             2.9      2017-04-22 Bioconductor                       
##  DOSE            * 2.10.7   2016-07-21 Bioconductor                       
##  dplyr           * 0.5.0    2016-06-24 CRAN (R 3.3.0)                     
##  edgeR           * 3.14.0   2016-05-04 Bioconductor                       
##  evaluate          0.10     2016-10-11 CRAN (R 3.3.1)                     
##  ggdendro          0.1-20   2016-04-27 CRAN (R 3.3.0)                     
##  ggplot2         * 2.2.1    2016-12-30 CRAN (R 3.3.2)                     
##  ggrepel         * 0.6.5    2016-11-24 CRAN (R 3.3.2)                     
##  GO.db             3.3.0    2017-04-22 Bioconductor                       
##  GOSemSim          1.30.3   2016-07-20 Bioconductor                       
##  graph             1.50.0   2016-05-04 Bioconductor                       
##  gridExtra       * 2.2.1    2016-02-29 CRAN (R 3.3.0)                     
##  GSEABase          1.34.1   2016-09-22 Bioconductor                       
##  gtable            0.2.0    2016-02-26 CRAN (R 3.3.0)                     
##  highr             0.6      2016-05-09 CRAN (R 3.3.0)                     
##  htmltools         0.3.5    2016-03-21 CRAN (R 3.3.0)                     
##  htmlwidgets       0.8      2016-11-09 CRAN (R 3.3.2)                     
##  httr              1.2.1    2016-07-03 CRAN (R 3.3.0)                     
##  igraph            1.0.1    2015-06-26 CRAN (R 3.3.0)                     
##  IRanges         * 2.6.1    2016-06-19 Bioconductor                       
##  jsonlite          1.0      2016-07-01 CRAN (R 3.3.1)                     
##  knitr           * 1.15.1   2016-11-22 CRAN (R 3.3.2)                     
##  labeling          0.3      2014-08-23 CRAN (R 3.3.0)                     
##  lattice           0.20-33  2015-07-14 CRAN (R 3.3.1)                     
##  lazyeval          0.2.0    2016-06-12 CRAN (R 3.3.0)                     
##  limma           * 3.28.21  2016-09-05 Bioconductor                       
##  magrittr          1.5      2014-11-22 CRAN (R 3.3.0)                     
##  MASS              7.3-45   2016-04-21 CRAN (R 3.3.1)                     
##  matrixStats       0.50.2   2016-04-24 CRAN (R 3.3.0)                     
##  memoise           1.0.0    2016-01-29 CRAN (R 3.3.0)                     
##  munsell           0.4.3    2016-02-13 CRAN (R 3.3.0)                     
##  org.Mm.eg.db    * 3.3.0    2017-04-22 Bioconductor                       
##  plotly          * 4.5.6    2016-11-12 CRAN (R 3.3.2)                     
##  plyr              1.8.4    2016-06-08 CRAN (R 3.3.0)                     
##  purrr             0.2.2    2016-06-18 CRAN (R 3.3.0)                     
##  qvalue            2.4.2    2016-05-16 Bioconductor                       
##  R6                2.1.2    2016-01-26 CRAN (R 3.3.0)                     
##  Rcpp              0.12.8   2016-11-17 cran (@0.12.8)                     
##  reshape2          1.4.2    2016-10-22 cran (@1.4.2)                      
##  rmarkdown         1.3      2016-12-21 CRAN (R 3.3.1)                     
##  rprojroot         1.1      2016-10-29 CRAN (R 3.3.0)                     
##  RSQLite           1.0.0    2014-10-25 CRAN (R 3.3.0)                     
##  S4Vectors       * 0.10.2   2016-07-08 Bioconductor                       
##  scales            0.4.1    2016-11-09 cran (@0.4.1)                      
##  SparseM           1.74     2016-11-10 cran (@1.74)                       
##  stringi           1.1.2    2016-10-01 cran (@1.1.2)                      
##  stringr           1.1.0    2016-08-19 CRAN (R 3.3.0)                     
##  superheat       * 0.1.0    2017-03-07 Github (rlbarter/superheat@88ed789)
##  tibble            1.2      2016-08-26 cran (@1.2)                        
##  tidyr           * 0.6.0    2016-08-12 CRAN (R 3.3.0)                     
##  topGO             2.24.0   2016-05-04 Bioconductor                       
##  viridisLite       0.1.3    2016-03-12 CRAN (R 3.3.0)                     
##  withr             1.0.2    2016-06-20 CRAN (R 3.3.0)                     
##  XML               3.98-1.4 2016-03-01 CRAN (R 3.3.0)                     
##  xtable            1.8-2    2016-02-05 CRAN (R 3.3.0)                     
##  yaml              2.1.13   2014-06-12 CRAN (R 3.3.0)